27 research outputs found

    Query-Time Data Integration

    Get PDF
    Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established. This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need. To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort. Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections

    A Domain-Specific Language for Do-It-Yourself Analytical Mashups

    Get PDF
    The increasing amount and variety of data available in the web leads to new possibilities in end-user focused data analysis. While the classic data base technologies for data integration and analysis (ETL and BI) are too complex for the needs of end users, newer technologies like web mashups are not optimal for data analysis. To make productive use of the data available on the web, end users need easy ways to find, join and visualize it. We propose a domain specific language (DSL) for querying a repository of heterogeneous web data. In contrast to query languages such as SQL, this DSL describes the visualization of the queried data in addition to the selection, filtering and aggregation of the data. The resulting data mashup can be made interactive by leaving parts of the query variable. We also describe an abstraction layer above this DSL that uses a recommendation-driven natural language interface to reduce the difficulty of creating queries in this DSL

    Frontiers in Crowdsourced Data Integration

    Get PDF
    There is an ever-increasing amount and variety of open web data available that is insufficiently examined or not considered at all in decision making processes. This is because of the lack of end-user friendly tools that help to reuse this public data and to create knowledge out of it. Therefore, we propose a schema-optional data repository that provides the flexibility necessary to store and gradually integrate heterogeneous web data. Based on this repository, we propose a semi-automatic schema enrichment approach that efficiently augments the data in a “pay-as-you-go” fashion. Due to the inherently appearing ambiguities we further propose a crowd-based verification component that is able to resolve such conflicts in a scalable manner.Die stetig wachsende Zahl offen verfügbarer Webdaten findet momentan viel zu wenig oder gar keine Berücksichtigung in Entscheidungsprozessen. Der Grund hierfür ist insbesondere in der mangelnden Unterstützung durch anwenderfreundliche Werkzeuge zu finden, die diese Daten nutzbar machen und Wissen daraus genieren können. Zu diesem Zweck schlagen wir ein schemaoptionales Datenrepositorium vor, welches ermöglicht, heterogene Webdaten zu speichern sowie kontinuierlich zu integrieren und mit Schemainformation anzureichern. Auf Grund der dabei inhärent auftretenden Mehrdeutigkeiten, soll dieser Prozess zusätzlich um eine Crowd-basierende Verifikationskomponente unterstützt werden

    Top-k Entity Augmentation using Consistent Set Covering

    Get PDF
    Entity augmentation is a query type in which, given a set of entities and a large corpus of possible data sources, the values of a missing attribute are to be retrieved. State of the art methods return a single result that, to cover all queried entities, is fused from a potentially large set of data sources. We argue that queries on large corpora of heterogeneous sources using information retrieval and automatic schema matching methods can not easily return a single result that the user can trust, especially if the result is composed from a large number of sources that user has to verify manually. We therefore propose to process these queries in a Top-k fashion, in which the system produces multiple minimal consistent solutions from which the user can choose to resolve the uncertainty of the data sources and methods used. In this paper, we introduce and formalize the problem of consistent, multi-solution set covering, and present algorithms based on a greedy and a genetic optimization approach. We then apply these algorithms to Web table-based entity augmentation. The publication further includes a Web table corpus with 100M tables, and a Web table retrieval and matching system in which these algorithms are implemented. Our experiments show that the consistency and minimality of the augmentation results can be improved using our set covering approach, without loss of precision or coverage and while producing multiple alternative query results

    OPEN—Enabling Non-expert Users to Extract, Integrate, and Analyze Open Data

    Get PDF
    Government initiatives for more transparency and participation have lead to an increasing amount of structured data on the web in recent years. Many of these datasets have great potential. For example, a situational analysis and meaningful visualization of the data can assist in pointing out social or economic issues and raising people’s awareness. Unfortunately, the ad-hoc analysis of this so-called Open Data can prove very complex and time-consuming, partly due to a lack of efficient system support.On the one hand, search functionality is required to identify relevant datasets. Common document retrieval techniques used in web search, however, are not optimized for Open Data and do not address the semantic ambiguity inherent in it. On the other hand, semantic integration is necessary to perform analysis tasks across multiple datasets. To do so in an ad-hoc fashion, however, requires more flexibility and easier integration than most data integration systems provide. It is apparent that an optimal management system for Open Data must combine aspects from both classic approaches. In this article, we propose OPEN, a novel concept for the management and situational analysis of Open Data within a single system. In our approach, we extend a classic database management system, adding support for the identification and dynamic integration of public datasets. As most web users lack the experience and training required to formulate structured queries in a DBMS, we add support for non-expert users to our system, for example though keyword queries. Furthermore, we address the challenge of indexing Open Data

    Column-specific Context Extraction for Web Tables

    Get PDF
    Relational Web tables have become an important resource for applications such as factual search and entity augmentation. A major challenge for an automatic identification of relevant tables on the Web is the fact that many of these tables have missing or non-informative column labels. Research has focused largely on recovering the meaning of columns by inferring class labels from the instances using external knowledge bases. The table context, which often contains additional information on the table's content, is frequently considered as an indicator for the general content of a table, but not as a source for column-specific details. In this paper, we propose a novel approach to identify and extract column-specific information from the context of Web tables. In our extraction framework, we consider different techniques to extract directly as well as indirectly related phrases. We perform a number of experiments on Web tables extracted from Wikipedia. The results show that column-specific information extracted using our simple heuristic significantly boost precision and recall for table and column search

    Towards a Hybrid Imputation Approach Using Web Tables

    Get PDF
    Data completeness is one of the most important data quality dimensions and an essential premise in data analytics. With new emerging Big Data trends such as the data lake concept, which provides a low cost data preparation repository instead of moving curated data into a data warehouse, the problem of data completeness is additionally reinforced. While traditionally the process of filling in missing values is addressed by the data imputation community using statistical techniques, we complement these approaches by using external data sources from the data lake or even the Web to lookup missing values. In this paper we propose a novel hybrid data imputation strategy that, takes into account the characteristics of an incomplete dataset and based on that chooses the best imputation approach, i.e. either a statistical approach such as regression analysis or a Web-based lookup or a combination of both. We formalize and implement both imputation approaches, including a Web table retrieval and matching system and evaluate them extensively using a corpus with 125M Web tables. We show that applying statistical techniques in conjunction with external data sources will lead to a imputation system which is robust, accurate, and has high coverage at the same time

    Publish-Time Data Integration for Open Data Platforms

    Get PDF
    Platforms for publication and collaborative management of data, such as Data.gov or Google Fusion Tables, are a new trend on the web. They manage very large corpora of datasets, but often lack an integrated schema, ontology, or even just common publication standards. This results in inconsistent names for attributes of the same meaning, which constrains the discovery of relationships between datasets as well as their reusability. Existing data integration techniques focus on reuse-time, i.e., they are applied when a user wants to combine a specific set of datasets or integrate them with an existing database. In contrast, this paper investigates a novel method of data integration at publish-time, where the publisher is provided with suggestions on how to integrate the new dataset with the corpus as a whole, without resorting to a manually created mediated schema or ontology for the platform. We propose data-driven algorithms that propose alternative attribute names for a newly published dataset based on attribute- and instance statistics maintained on the corpus. We evaluate the proposed algorithms using real-world corpora based on the Open Data Platform opendata.socrata.com and relational data extracted from Wikipedia. We report on the system's response time, and on the results of an extensive crowdsourcing-based evaluation of the quality of the generated attribute names alternatives
    corecore